Demi_Yang - MUSA 5080 Portfolio
  • Home
  • About
  • Weekly notes
    • All notes
    • Week 1
    • Week 2
    • Week 3
    • Week 4
  • Labs
    • Lab 0: dplyr Basics
  • Assignments
    • Assignment 1: Census Data Explorations
    • Assignment 2: Spatial Analysis and Visualization
    • Assignment 4: Spatial Predictive Analysis
    • Assignment 5: Space Time Prediction
  • Mid-term
    • Appendix: Residential Sale Price Prediction
    • Slides: Residential Sale Price Prediction
  • Final
    • Appendix: Crime Count Prediction

On this page

  • Assignment Overview
  • Part 1: Healthcare Access for Vulnerable Populations
  • Part 2: Comprehensive Visualization
  • Part 3: Bring Your Own Data Analysis
  • Finally - A few comments about your incorporation of feedback!
  • Submission Requirements

Assignment 2: Spatial Analysis and Visualization

Healthcare Access and Equity in Pennsylvania

Author

Yuqing(Demi) Yang

Published

December 7, 2025

Assignment Overview

Learning Objectives: - Apply spatial operations to answer policy-relevant research questions - Integrate census demographic data with spatial analysis - Create publication-quality visualizations and maps - Work with spatial data from multiple sources - Communicate findings effectively for policy audiences


Part 1: Healthcare Access for Vulnerable Populations

Research Question

  • Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?

Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.

Required Analysis Steps

Complete the following analysis, documenting each step with code and brief explanations:

Step 1: Data Collection (5 points)

Load the required spatial data: - Pennsylvania county boundaries - Pennsylvania hospitals (from lecture data) - Pennsylvania census tracts

Your Task:

# Load required packages

library(sf)
library(tidyverse)
library(tigris)
library(tidycensus)
library(scales)
library(patchwork)
library(here)
library(units)
library(kableExtra)
census_api_key("42bf8a20a3df1def380f330cf7edad0dd5842ce6")

# Load spatial data

options(tigris_progress = FALSE, tigris_use_cache = TRUE)
pa_counties <- st_read(here("assignments/assignment_2/data/Pennsylvania_County_Boundaries.shp"),quiet = TRUE)
hospitals <- st_read(here("assignments/assignment_2/data/hospitals.geojson"),quiet = TRUE)
census_tracts <- tracts(state = "PA", cb = TRUE)

# Check that all data loaded correctly

## 1)Make a table for checking

layers <- list(
  pa_counties   = pa_counties,
  hospitals     = hospitals,
  census_tracts = census_tracts
)

qc <- function(x) {
  data.frame(
    is_sf    = inherits(x, "sf"),
    n        = nrow(x),
    geom     = paste(unique(as.character(st_geometry_type(x))), collapse = ", "),
    crs_epsg = st_crs(x)$epsg,
    crs_txt  = st_crs(x)$input,
    invalid  = sum(!st_is_valid(x)),
    empty    = sum(st_is_empty(x)),
    bbox     = paste(round(st_bbox(x), 3), collapse = ", ")
  )
}

qc_table <- do.call(rbind, lapply(layers, qc)) %>%
  tibble::rownames_to_column("Layer")

kable(
  qc_table,
  caption = "Summary of Spatial Layers and Geometry Quality Checks",
  align = "c"
) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover", "condensed")
  )
Summary of Spatial Layers and Geometry Quality Checks
Layer is_sf n geom crs_epsg crs_txt invalid empty bbox
pa_counties TRUE 67 MULTIPOLYGON 3857 WGS 84 / Pseudo-Mercator 0 0 -8963376.946, 4825316.222, -8314403.862, 5201412.518
hospitals TRUE 223 POINT 4326 WGS 84 0 0 -80.496, 39.752, -74.867, 42.134
census_tracts TRUE 3445 MULTIPOLYGON 4269 NAD83 0 0 -80.52, 39.72, -74.69, 42.27

Questions to answer:

  • How many hospitals are in your dataset?
    • 223
  • How many census tracts?
    • 3445
  • What coordinate reference system is each dataset in?
    • Pennsylvania county boundaries is in EPSG:3857(WGS 84 / Web-Mercator, meters), Pennsylvania hospitals is in EPSG:4326(WGS 84 geographic, degrees), and Pennsylvania census tracts is in EPSG:4269(NAD83 geographic, degrees).

Step 2: Get Demographic Data

Use tidycensus to download tract-level demographic data for Pennsylvania.

Required variables: - Total population - Median household income - Population 65 years and over (you may need to sum multiple age categories)

Your Task:

# Get demographic data from ACS

## 1)Reach the data
core_vars <- c(
  total_pop     = "B01003_001",
  median_income = "B19013_001"
)

acs_core <- get_acs(
  geography = "tract",
  state     = "PA",
  variables = core_vars,
  year      = 2022,
  survey    = "acs5",
  output    = "wide",
  geometry  = FALSE
) %>%
  transmute(
    GEOID,
    total_pop     = total_popE,
    median_income = median_incomeE
  )

## 2) Calculate the population 65 years and over
age65_vars <- c(sprintf("B01001_%03d", 20:25), sprintf("B01001_%03d", 44:49))

acs_65_wide <- get_acs(
  geography = "tract", state = "PA", variables = age65_vars,
  year = 2022, survey = "acs5"
) %>%
  select(GEOID, variable, estimate) %>%
  pivot_wider(names_from = variable, values_from = estimate)

acs_65plus <- acs_65_wide %>%
  mutate(
    non_na_bins = rowSums(!is.na(across(all_of(age65_vars)))),
    over65 = if_else(
      non_na_bins == 0, NA_real_,
      rowSums(across(all_of(age65_vars)), na.rm = TRUE)
    )
  ) %>%
  select(GEOID, over65)

# Join to tract boundaries
options(tigris_progress = FALSE, tigris_use_cache = TRUE)

pa_tracts <- tracts(state = "PA", year = 2022, cb = TRUE, class = "sf")

pa_tracts_joined <- pa_tracts %>%
  left_join(acs_core,   by = "GEOID") %>%
  left_join(acs_65plus, by = "GEOID") %>%
  mutate(
    over65_share = if_else(!is.na(total_pop) & total_pop > 0, over65 / total_pop, NA_real_)
  )

## 2) Count the missing data and medium data of income
income_na_n   <- sum(is.na(pa_tracts_joined$median_income))
income_median_overall <- median(pa_tracts_joined$median_income, na.rm = TRUE)

## 3) Remove missing values
pa_tracts_demo <- pa_tracts_joined %>%
  filter(!is.na(median_income), !is.na(over65_share))

Questions to answer:

  • What year of ACS data are you using?
    • 2022
  • How many tracts have missing income data?
      1. Mostly because household = 0, or household is extremely small.
  • What is the median income across all PA census tracts?
    • 70188

Step 3: Define Vulnerable Populations

Identify census tracts with vulnerable populations based on TWO criteria: 1. Low median household income (choose an appropriate threshold) 2. Significant elderly population (choose an appropriate threshold)

Your Task:

# Filter for vulnerable tracts based on your criteria

## 1) thresholds
income_threshold  <- quantile(pa_tracts_demo$median_income, 0.25, na.rm = TRUE, names = FALSE)
elderly_q75       <- quantile(pa_tracts_demo$over65_share, 0.75, na.rm = TRUE, names = FALSE)

## 2) flags + vulnerable
pa_tracts_demo <- pa_tracts_demo %>%
  mutate(
    low_income   = median_income <= income_threshold,
    high_elderly = over65_share >= elderly_q75,  
    vulnerable   = low_income & high_elderly
  )

## 3) quick outputs
elderly_q75
[1] 0.2314351
sum(pa_tracts_demo$vulnerable, na.rm = TRUE)
[1] 165
round(mean(pa_tracts_demo$vulnerable, na.rm = TRUE) * 100, 1)
[1] 4.9

Questions to answer:

  • What income threshold did you choose and why?
    • The low-income threshold = tract median income ≤ state 25th percentile.A percentile-based, within-state threshold is robust to outliers and gives a clear “most disadvantaged” group.
  • What elderly population threshold did you choose and why?
    • The aging threshold = 65+ share ≥ state 75th percentile.Using percentile cutoffs to identify the “upper tail” aligns with screening practice.Both thresholds follow a percentile-ranking approach widely used by federal screening tools (SVI/EJScreen).
  • How many tracts meet your vulnerability criteria?
    • 165
  • What percentage of PA census tracts are considered vulnerable by your definition?
    • 4.9%

Step 4: Calculate Distance to Hospitals

For each vulnerable tract, calculate the distance to the nearest hospital.

Your Task:

# Transform to appropriate projected CRS

## 1) Copy and reproject (without modifying the original object)
distance_crs <- 5070
pa_tracts_demo_5070 <- st_transform(pa_tracts_demo, distance_crs)
hospitals_5070 <- st_transform(hospitals, distance_crs)

## 2) Keep only vulnerable tracts
vuln_5070 <- pa_tracts_demo_5070 %>%
  dplyr::filter(vulnerable %in% TRUE)

stopifnot(nrow(hospitals_5070) > 0)
stopifnot(nrow(vuln_5070) > 0)

# Calculate distance from each tract centroid to nearest hospital

## 1) Use an interior representative point
vuln_pts_5070 <- st_centroid(vuln_5070)

## 2) Pairwise distance matrix (meters)
dist_mat_m <- st_distance(vuln_pts_5070, hospitals_5070)

## 3) Distance: tract centroid -> nearest hospital
nearest_idx <- apply(dist_mat_m, 1, function(x) which.min(as.numeric(x)))
min_dist_m <- apply(dist_mat_m, 1, function(x) min(as.numeric(x), na.rm = TRUE))

## 4) meters → miles
dist_mi <- min_dist_m / 1609.344

## 5) Attach back to vulnerable tracts
vulnerable_tracts_dist_5070 <- vuln_5070 %>%
  mutate(
    nearest_hospital_row = nearest_idx,
    dist_to_hospital_mi  = dist_mi)


avg_dist_mi  <- mean(vulnerable_tracts_dist_5070$dist_to_hospital_mi, na.rm = TRUE)
max_dist_mi  <- max(vulnerable_tracts_dist_5070$dist_to_hospital_mi, na.rm = TRUE)
over15_count <- sum(vulnerable_tracts_dist_5070$dist_to_hospital_mi > 15, na.rm = TRUE)

cat(
  "Average distance to nearest hospital (miles): ", round(avg_dist_mi, 2), "\n",
  "Maximum distance (miles): ",                   round(max_dist_mi, 2), "\n",
  "Vulnerable tracts > 15 miles: ",               over15_count, "\n",
  sep = ""
)
Average distance to nearest hospital (miles): 4.75
Maximum distance (miles): 19.32
Vulnerable tracts > 15 miles: 10

Requirements:

  • Use an appropriate projected coordinate system for Pennsylvania
  • Calculate distances in miles
  • Explain why you chose your projection
    • I use EPSG:5070 (NAD83 / Conus Albers) because it provides low distortion across the conterminous U.S., so distances computed statewide in Pennsylvania are much more reliable than in Web Mercator (distorts scale) or unprojected WGS84 (degrees). It also uses meters, making distance and area calculations straightforward, and it avoids zone-splitting issues you’d hit with StatePlane or UTM.

Questions to answer:

  • What is the average distance to the nearest hospital for vulnerable tracts?
    • 4.75
  • What is the maximum distance?
    • 19.32
  • How many vulnerable tracts are more than 15 miles from the nearest hospital?
    • 10

Step 5: Identify Underserved Areas

Define “underserved” as vulnerable tracts that are more than 15 miles from the nearest hospital.

Your Task:

# Create underserved variable

vulnerable_tracts_dist_5070 <- vulnerable_tracts_dist_5070 %>%
  dplyr::mutate(underserved = dist_to_hospital_mi > 15)


sum(vulnerable_tracts_dist_5070$underserved, na.rm = TRUE)
[1] 10
round(mean(vulnerable_tracts_dist_5070$underserved, na.rm = TRUE) * 100, 1)
[1] 6.1

Questions to answer:

  • How many tracts are underserved?
    • 10
  • What percentage of vulnerable tracts are underserved?
    • 6.1%
  • Does this surprise you? Why or why not?
    • Hospitals cluster in metro areas, so it’s expected that some rural, low-density tracts fall farther from facilities; under my definition, only ~6% of vulnerable tracts are underserved, which suggests overall access is fairly good. That said, pockets of concern remain, and distance alone doesn’t capture barriers like transportation or service capacity.

Step 6: Aggregate to County Level

Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.

Your Task:

# Spatial join tracts to counties

pa_counties_5070 <- st_transform(pa_counties, 5070)

vuln_with_county <- vulnerable_tracts_dist_5070 %>%
  st_join(pa_counties_5070 %>% select(COUNTY_NAM))

# Aggregate statistics by county

county_summary <- vuln_with_county %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarise(
    vulnerable_tracts          = n(),
    underserved_tracts         = sum(underserved, na.rm = TRUE),
    pct_underserved            = 100 * underserved_tracts / vulnerable_tracts,
    avg_dist_mi_vulnerable     = mean(dist_to_hospital_mi, na.rm = TRUE),
    vulnerable_population      = sum(total_pop, na.rm = TRUE),
    vulnerable_pop_far_mi15    = sum(ifelse(underserved, total_pop, 0), na.rm = TRUE),
    .groups = "drop")
county_summary %>%
  arrange(desc(pct_underserved)) %>%
  select(COUNTY_NAM, vulnerable_tracts, underserved_tracts, pct_underserved) %>%
  slice_head(n = 5)
# A tibble: 5 × 4
  COUNTY_NAM vulnerable_tracts underserved_tracts pct_underserved
  <chr>                  <int>              <int>           <dbl>
1 BRADFORD                   1                  1             100
2 COLUMBIA                   1                  1             100
3 JUNIATA                    1                  1             100
4 MONROE                     1                  1             100
5 PERRY                      2                  2             100
county_summary %>%
  arrange(desc(vulnerable_pop_far_mi15)) %>%
  select(COUNTY_NAM, vulnerable_population, vulnerable_pop_far_mi15) %>%
  slice_head(n = 5) 
# A tibble: 5 × 3
  COUNTY_NAM     vulnerable_population vulnerable_pop_far_mi15
  <chr>                          <dbl>                   <dbl>
1 PERRY                           5800                    5800
2 BRADFORD                        5466                    5466
3 CLEARFIELD                     16232                    5189
4 DAUPHIN                         8410                    4018
5 NORTHUMBERLAND                 13105                    4018

Required county-level statistics:

  • Number of vulnerable tracts
  • Number of underserved tracts
  • Percentage of vulnerable tracts that are underserved
  • Average distance to nearest hospital for vulnerable tracts
  • Total vulnerable population

Questions to answer:

  • Which 5 counties have the highest percentage of underserved vulnerable tracts?
    • BRADFORD,COLUMBIA,JUNIATA,MONROE,PERRY
  • Which counties have the most vulnerable people living far from hospitals?
    • PERRY(5800),BRADFORD(5466),CLEARFIELD(5189),DAUPHIN(4018),NORTHUMBERLAND(4018)
  • Are there any patterns in where underserved counties are located?
    • Underserved counties tend to cluster in rural and lower-density parts of Pennsylvania, where hospital coverage is sparse and facilities are concentrated in metro hubs. You’ll often see higher shares along the northern/central rural corridor and Appalachian areas, reflecting longer travel distances and thinner provider networks relative to population.

Step 7: Create Summary Table

Create a professional table showing the top 10 priority counties for healthcare investment.

Your Task:

# Create and format priority counties table

priority_tbl <- county_summary %>%
  mutate(
    # safety: replace any NAs used in scoring
    across(c(vulnerable_pop_far_mi15, pct_underserved, avg_dist_mi_vulnerable,
             vulnerable_population, vulnerable_tracts, underserved_tracts),
           ~ tidyr::replace_na(., 0)),
    # Priority metric
    priority_score = 0.60 * rescale(vulnerable_pop_far_mi15) +
                     0.25 * rescale(pct_underserved) +
                     0.15 * rescale(avg_dist_mi_vulnerable)
  ) %>%
  arrange(desc(priority_score)) %>%
  slice_head(n = 10) %>%
  transmute(
    County                         = COUNTY_NAM,
    `Vulnerable tracts`            = vulnerable_tracts,
    `Underserved tracts`           = underserved_tracts,
    `% vulnerable underserved`     = percent(pct_underserved / 100, accuracy = 0.1),
    `Avg distance (mi)`            = number(avg_dist_mi_vulnerable, accuracy = 0.1),
    `Vulnerable people >15 mi`     = comma(vulnerable_pop_far_mi15),
    `Priority score`               = number(priority_score, accuracy = 0.01)
  )

# Nicely formatted table
knitr::kable(
  priority_tbl,
  caption = "Top 10 Priority Pennsylvania Counties for Healthcare Investment
  (Priority Score weights: 60% vulnerable people >15 mi, 25% % vulnerable underserved, 15% average distance)",
  booktabs = TRUE
) |>
  kableExtra::kable_styling(full_width = FALSE, font_size = 12)
Top 10 Priority Pennsylvania Counties for Healthcare Investment (Priority Score weights: 60% vulnerable people >15 mi, 25% % vulnerable underserved, 15% average distance)
County Vulnerable tracts Underserved tracts % vulnerable underserved Avg distance (mi) Vulnerable people >15 mi Priority score
PERRY 2 2 100.0% 17.7 5,800 1.00
BRADFORD 1 1 100.0% 16.7 5,466 0.96
CLEARFIELD 5 2 40.0% 10.1 5,189 0.72
DAUPHIN 2 1 50.0% 10.0 4,018 0.62
NORTHUMBERLAND 5 1 20.0% 12.8 4,018 0.57
JUNIATA 1 1 100.0% 16.0 1,782 0.57
MONROE 1 1 100.0% 17.6 1,299 0.53
FOREST 2 1 50.0% 15.2 2,701 0.53
ELK 4 1 25.0% 11.0 3,343 0.50
COLUMBIA 1 1 100.0% 16.9 918 0.49

Requirements: - Use knitr::kable() or similar for formatting - Include descriptive column names - Format numbers appropriately (commas for population, percentages, etc.) - Add an informative caption - Sort by priority (you decide the metric)


Part 2: Comprehensive Visualization

Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.

Map 1: County-Level Choropleth

Create a choropleth map showing healthcare access challenges at the county level.

Your Task:

# Create county-level access map

# 1) Reproject display layers to WGS84 (EPSG:4326)
crs_display <- 4326
pa_counties_4326 <- st_transform(pa_counties, crs_display)
hospitals_4326   <- st_transform(hospitals, crs_display)

# 2) Join county stats to county geometries
county_map_4326 <- pa_counties_4326 %>%
  left_join(
    county_summary %>%
      mutate(pct_underserved_prop = pct_underserved / 100), 
    by = "COUNTY_NAM"
  )

# 3) Choropleth (WGS84) + hospital points
ggplot(county_map_4326) +
  geom_sf(aes(fill = pct_underserved_prop), color = "white", linewidth = 0.2) +
  geom_sf(data = hospitals_4326, shape = 21, fill = "white", color = "black",
          size = 1.1, alpha = 0.8, inherit.aes = FALSE) +
  scale_fill_viridis_c(
    option = "cividis",
    na.value = "grey90",
    name = "Underserved share\n(of vulnerable tracts)",
    labels = label_percent(accuracy = 1)
  ) +
  labs(
    title    = "Healthcare Access Challenges in Pennsylvania",
    subtitle = "Counties filled by % of vulnerable tracts > 15 miles from a hospital",
    caption  = "Sources: ACS 5-year (2022), lecture hospital dataset. Distances computed in EPSG:5070; map shown in WGS84."
  ) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title      = element_text(face = "bold"),
    plot.caption    = element_text(hjust = 0)
  )

Requirements: - Fill counties by percentage of vulnerable tracts that are underserved - Include hospital locations as points - Use an appropriate color scheme - Include clear title, subtitle, and caption - Use theme_void() or similar clean theme - Add a legend with formatted labels


Map 2: Detailed Vulnerability Map

Create a map highlighting underserved vulnerable tracts.

Your Task:

# Create detailed tract-level map

## 1)Ensure "underserved" exists; keep only vulnerable tracts and make a status label
vuln_dist_4326 <- vulnerable_tracts_dist_5070 %>%
  mutate(underserved = if (!"underserved" %in% names(.)) dist_to_hospital_mi > 15 else underserved) %>%
  st_transform(crs_display) %>%
  mutate(
    status = ifelse(underserved, "Underserved vulnerable", "Other vulnerable")
  )

## 2)Map
tracts_4326 <- sf::st_transform(census_tracts, 4326)
ggplot() +
  # county boundaries (context; subtle)
  geom_sf(data = pa_counties_4326, fill = NA, color = "grey40", linewidth = 0.3) +
  
  # tracts
  geom_sf(data = tracts_4326, fill = NA, color = "grey80", linewidth = 0.05) +
  
  # vulnerable tracts
  geom_sf(data = subset(vuln_dist_4326, status == "Other vulnerable"),
  aes(fill = status), color = "black", linewidth = 0.15, alpha = 0.9) +
  
  # underserved vulnerable tracts
  geom_sf(data = subset(vuln_dist_4326, status == "Underserved vulnerable"),
  aes(fill = status), color = "black", linewidth = 0.15, alpha = 0.9) +
  
  # hospitals (points)
  geom_sf(
  data = hospitals_4326,
  aes(shape = "Hospitals"),
  fill = "white", color = "black",
  size = 1.2, alpha = 0.9,
  inherit.aes = FALSE,
  show.legend = "point"
) +
  
  # fills for tracts
  scale_fill_manual(
    name = NULL,
    values = c("Other vulnerable" = "#9ecae1", "Underserved vulnerable" = "#08519c")
  ) +
  # shape for hospitals + legend tweaks
  scale_shape_manual(name = NULL, values = c("Hospitals" = 21)) +
  guides(
    fill  = guide_legend(order = 1, override.aes = list(shape = NA)),
    shape = guide_legend(order = 2, override.aes = list(fill = "white", color = "black", size = 2))
  ) +
  labs(
    title    = "Underserved Vulnerable Census Tracts in Pennsylvania",
    subtitle = "Underserved = vulnerable tracts (>15 miles to nearest hospital); counties and tract grids shown for context",
    caption  = "Sources: ACS 5-year (2022); hospital dataset. Distances: EPSG:5070; map: WGS84."
  ) +
  theme_void() +
  theme(legend.position = "right", plot.title = element_text(face = "bold"))

Requirements:

  • Show underserved vulnerable tracts in a contrasting color

  • Include county boundaries for context

  • Show hospital locations

  • Use appropriate visual hierarchy (what should stand out?)

  • Include informative title and subtitle


Chart: Distribution Analysis

Create a visualization showing the distribution of distances to hospitals for vulnerable populations.

Your Task:

# Create distribution visualization

## A small helper used in captions
median_dist <- median(vulnerable_tracts_dist_5070$dist_to_hospital_mi, na.rm = TRUE)

## 1) People-weighted histogram of distances (highlighting the 15-mile threshold)
vulnerable_tracts_dist_5070 |> 
  mutate(beyond15 = factor(dist_to_hospital_mi > 15,
                           levels = c(FALSE, TRUE),
                           labels = c("≤ 15 miles", "> 15 miles"))) |>
  ggplot(aes(x = dist_to_hospital_mi, weight = total_pop, fill = beyond15)) +
  geom_histogram(binwidth = 1, boundary = 0, color = "white") +
  scale_fill_manual(values = c("≤ 15 miles" = "#9ecae1", "> 15 miles" = "#08519c"),
                    name = NULL) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "How far are vulnerable residents from hospitals?",
    x = "Distance to nearest hospital (miles)",
    y = "Residents (people)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "right")

# 2) Scatter: Distance vs. tract population (vulnerable tracts)
ggplot(vulnerable_tracts_dist_5070,
       aes(x = dist_to_hospital_mi, y = total_pop, color = underserved)) +
  geom_point(alpha = 0.7) +
  scale_y_continuous(labels = comma, trans = "log10") +
  scale_color_manual(values = c(`FALSE` = "#9ecae1", `TRUE` = "#08519c"),
                     name = "Underserved\n(>15 mi)") +
  labs(
    title = "Distance vs. population size across vulnerable tracts",
    x = "Distance to nearest hospital (miles)",
    y = "Tract population (people)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "right")

Suggested chart types:

  • Histogram or density plot of distances

  • Box plot comparing distances across regions

  • Bar chart of underserved tracts by county

  • Scatter plot of distance vs. vulnerable population size

Requirements:

  • Clear axes labels with units

  • Appropriate title

  • Professional formatting

  • Brief interpretation (1-2 sentences as a caption or in text)

    • For histogram: Most vulnerable residents live within about %.1f miles; the right tail beyond 15 miles is relatively small.
    • For scatter plot: Most high-population vulnerable tracts are within ~15 miles; the far-distance outliers tend to be lower-density.

Part 3: Bring Your Own Data Analysis

Choose your own additional spatial dataset and conduct a supplementary analysis.

Challenge Options

Choose ONE of the following challenge exercises, or propose your own research question using OpenDataPhilly data (https://opendataphilly.org/datasets/).


Education & Youth Services

Option A: Educational Desert Analysis - Data: Schools, Libraries, Recreation Centers, Census tracts (child population) - Question: “Which neighborhoods lack adequate educational infrastructure for children?” - Operations: Buffer schools/libraries (0.5 mile walking distance), identify coverage gaps, overlay with child population density - Policy relevance: School district planning, library placement, after-school program siting

Option B: School Safety Zones - Data: Schools, Crime Incidents, Bike Network - Question: “Are school zones safe for walking/biking, or are they crime hotspots?” - Operations: Buffer schools (1000ft safety zone),spatial join with crime incidents, assess bike infrastructure coverage - Policy relevance: Safe Routes to School programs, crossing guard placement


Environmental Justice

Option C: Green Space Equity - Data: Parks, Street Trees, Census tracts (race/income demographics) - Question: “Do low-income and minority neighborhoods have equitable access to green space?” - Operations: Buffer parks (10-minute walk = 0.5 mile), calculate tree canopy or park acreage per capita, compare by demographics - Policy relevance: Climate resilience, environmental justice, urban forestry investment —

Public Safety & Justice

Option D: Crime & Community Resources - Data: Crime Incidents, Recreation Centers, Libraries, Street Lights - Question: “Are high-crime areas underserved by community resources?” - Operations: Aggregate crime counts to census tracts or neighborhoods, count community resources per area, spatial correlation analysis - Policy relevance: Community investment, violence prevention strategies —

Infrastructure & Services

Option E: Polling Place Accessibility - Data: Polling Places, SEPTA stops, Census tracts (elderly population, disability rates) - Question: “Are polling places accessible for elderly and disabled voters?” - Operations: Buffer polling places and transit stops, identify vulnerable populations, find areas lacking access - Policy relevance: Voting rights, election infrastructure, ADA compliance


Health & Wellness

Option F: Recreation & Population Health - Data: Recreation Centers, Playgrounds, Parks, Census tracts (demographics) - Question: “Is lack of recreation access associated with vulnerable populations?” - Operations: Calculate recreation facilities per capita by neighborhood, buffer facilities for walking access, overlay with demographic indicators - Policy relevance: Public health investment, recreation programming, obesity prevention


Emergency Services

Option G: EMS Response Coverage - Data: Fire Stations, EMS stations, Population density, High-rise buildings - Question: “Are population-dense areas adequately covered by emergency services?” - Operations: Create service area buffers (5-minute drive = ~2 miles), assess population coverage, identify gaps in high-density areas - Policy relevance: Emergency preparedness, station siting decisions


Arts & Culture

Option H: Cultural Asset Distribution - Data: Public Art, Museums, Historic sites/markers, Neighborhoods - Question: “Do all neighborhoods have equitable access to cultural amenities?” - Operations: Count cultural assets per neighborhood, normalize by population, compare distribution across demographic groups - Policy relevance: Cultural equity, tourism, quality of life, neighborhood identity


Data Sources

OpenDataPhilly: https://opendataphilly.org/datasets/ - Most datasets available as GeoJSON, Shapefile, or CSV with coordinates - Always check the Metadata for a data dictionary of the fields.

Additional Sources: - Pennsylvania Open Data: https://data.pa.gov/ - Census Bureau (via tidycensus): Demographics, economic indicators, commute patterns - TIGER/Line (via tigris): Geographic boundaries

Recommended Starting Points

If you’re feeling confident: Choose an advanced challenge with multiple data layers. If you are a beginner, choose something more manageable that helps you understand the basics

If you have a different idea: Propose your own question! Just make sure: - You can access the spatial data - You can perform at least 2 spatial operations

My Analysis

My Task:

  1. Find and load additional data
    • Document your data source
    • Check and standardize the CRS
    • Provide basic summary statistics
# Load your additional dataset

## 1) Philly tracts

options(tigris_progress = FALSE, tigris_use_cache = TRUE)
phl_tracts <- tracts(state = "PA", county = "Philadelphia", cb = TRUE, year = 2022)

phl_boundary <- counties(state = "PA", cb = TRUE, year = 2022) %>%
  filter(NAME == "Philadelphia")

## 2)Reach children data
phl_children <- get_acs(
  geography = "tract",
  variables = c(child_pop = "B09001_001"),
  state = "PA",
  county = "Philadelphia",
  year = 2022,
  output = "wide"
)

tracts_with_children <- phl_tracts %>%
  left_join(phl_children, by = "GEOID")

## 3)Reach POI
### recreation center raw data(philly)
PPR_Program_sites <- st_read("https://opendata.arcgis.com/api/v3/datasets/9eb26a787a6e448ba426eea7f9f0d93a_0/downloads/data?format=geojson&spatialRefId=4326", quiet = TRUE)

### schools raw data(philly)
schools <- st_read("https://hub.arcgis.com/api/v3/datasets/d46a7e59e2c246c891fbee778759717e_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1", quiet = TRUE)

### libraries raw data(global)
libraries <- st_read( "data/Libraries/Libraries.gpkg",layer = "parsed", quiet = TRUE)
Integer64 values larger than 9.0072e+15 lost significance after conversion to double;
use argument int64_as_string = TRUE to import them lossless, as character
## 4) Check the data

summary_rows <- tibble::tibble(
  Dataset = c("Census Tracts (phl_tracts)", 
              "Children ACS (phl_children)", 
              "Schools", 
              "Libraries"),
  Rows = c(
    nrow(phl_tracts),
    nrow(phl_children),
    nrow(schools),
    nrow(libraries)
  )
)

kable(
  summary_rows,
  caption = "Summary of Raw Datasets Loaded",
  align = "c"
) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover", "condensed"),
    position = "center"
  )
Summary of Raw Datasets Loaded
Dataset Rows
Census Tracts (phl_tracts) 408
Children ACS (phl_children) 408
Schools 496
Libraries 81082
# Data cleaning
## 1)Recreation center
names(PPR_Program_sites)

### Check the unique values containing "Recreation" in the field and see how they are written
unique(unlist(PPR_Program_sites))
unique(PPR_Program_sites$FACILITY_TYPE)

### Filter"Recreation Centers"
recreation_centers <- PPR_Program_sites %>%
  filter(
    grepl("recreation", park_name, ignore.case = TRUE)
  )

### See results
nrow(recreation_centers)
head(recreation_centers)

## 2)Libraries
### reproject CRS of libraries to boundary's 
st_crs(libraries)
st_crs(phl_boundary)
libraries <- st_transform(libraries, st_crs(phl_boundary))

### Cut out the libraries within the Philly area
libraries_phl <- st_intersection(libraries, phl_boundary)

### See results
nrow(libraries_phl)
head(libraries_phl)


## 3)Summary table
summary_table <- tibble(
  Dataset = c("Libraries", 
              "Schools", 
              "Recreation Centers"),
  Features = c(nrow(libraries_phl), 
               nrow(schools), 
               nrow(recreation_centers)),
  CRS = c(st_crs(libraries)$input, 
          st_crs(schools)$input, 
          st_crs(PPR_Program_sites)$input)
)

kable(summary_table, caption = "Summary of Datasets Used")

Questions to answer:

  • What dataset did you choose and why?
    • I selected three datasets that represent educational and community infrastructure in Philadelphia: Libraries; Schools; Recreation Centers.
  • What is the data source and date?
    • Libraries: Extracted from OpenStreetMap (via global GeoPackage export, likely updated in 2024–2025).
    • Schools: From OpenDataPhilly(https://opendataphilly.org/datasets/schools/?utm_source=chatgpt.com) – Schools dataset, maintained by the City of Philadelphia.
    • Recreation Centers: From Philadelphia Parks & Recreation Program Sites dataset, also hosted on OpenDataPhilly.
    • All datasets were accessed in October. 2025.
  • How many features does it contain?
    • For libraries data, it has 81082 features at beginning, containing the libraries all over the world. After filter the libraries in Philly, the number of features shrinks to 80.
    • For schools, there are 495 schools(not including universities and colleges) in Philly.
    • For recreation centers, there are 67 of them in Philly.
  • What CRS is it in? Did you need to transform it?
    • Libraries data is in NAD 83.
    • Schools data is in WGS 84.
    • Recreations data is in WGS 84.

  1. Pose a research question

Write a clear research statement that your analysis will answer.

  • Which neighborhoods in Philadelphia lack adequate educational infrastructure for children?

  1. Conduct spatial analysis

Use at least TWO spatial operations to answer your research question.

Required operations (choose 2+): - Buffers - Spatial joins - Spatial filtering with predicates - Distance calculations - Intersections or unions - Point-in-polygon aggregation

Your Task:

# Unify the coordinates and prepare the buffer

## 1) Reproject all layers to 5070
crs_analysis <- 5070
crs_map      <- 4326
half_mile_m  <- set_units(0.3, "mile") %>% set_units("m") %>% drop_units()

phl_tracts_5070        <- st_transform(phl_tracts, crs_analysis)
phl_boundary_5070      <- st_transform(phl_boundary, crs_analysis)
schools_5070           <- st_transform(schools, crs_analysis)
recreation_centers_5070<- st_transform(recreation_centers, crs_analysis)
libraries_phl_5070     <- st_transform(libraries_phl, crs_analysis)


## 2) Create buffer(schools & libraries, recreation centers do sensitivity)
buf_schools   <- st_buffer(schools_5070,  dist = half_mile_m)
buf_libraries <- st_buffer(libraries_phl_5070, dist = half_mile_m)

## 3) Accessible region
access_union_core <- st_union(st_union(buf_schools), st_union(buf_libraries))

## 4) Use recreation center to sensitivity comparison
buf_recreation <- st_buffer(recreation_centers_5070, dist = half_mile_m)
access_union_all <- st_union(access_union_core, st_union(buf_recreation))

## 5)Check the result
st_crs(schools_5070)
st_crs(libraries_phl_5070)
st_crs(phl_tracts_5070)

ggplot() +
  geom_sf(data = phl_boundary_5070, fill = NA, color = "black") +
  geom_sf(data = buf_schools, fill = "darkslateblue", alpha = 0.3) +
  geom_sf(data = buf_libraries, fill = "steelblue", alpha = 0.3) +
  labs(title = "0.5-mile Buffers Around Schools and Libraries")

# Calculate the "education coverage rate" and "child density" of each census tracts

## 1) Area (m²) and child density (per square kilometer)

tracts_metrics <- phl_tracts_5070 %>%
  mutate(tract_area_m2 = as.numeric(st_area(geometry)),
         tract_area_km2 = tract_area_m2 / 1e6) %>%
  left_join(tracts_with_children %>% st_drop_geometry() %>%
              select(GEOID, child_pop = child_popE), by = "GEOID") %>%
  mutate(child_density = ifelse(tract_area_km2 > 0, child_pop / tract_area_km2, NA_real_))

## 2) Coverage rate = Intersection area between the reachable zone and the census zone (m²)
intersect_access <- st_intersection(
  tracts_metrics %>% select(GEOID),
  st_make_valid(access_union_core)
) %>%
  mutate(access_area_m2 = as.numeric(st_area(geometry))) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  summarize(access_area_m2 = sum(access_area_m2, na.rm = TRUE), .groups = "drop")

tracts_metrics <- tracts_metrics %>%
  left_join(intersect_access, by = "GEOID") %>%
  mutate(access_area_m2 = coalesce(access_area_m2, 0),
         access_ratio   = pmin(access_area_m2 / tract_area_m2, 1))

## 3)Check the statistical distribution of child_density and access_ratio
summary(tracts_metrics[, c("child_density", "access_ratio")])
 child_density      access_ratio             geometry  
 Min.   :    0.0   Min.   :0.0000   MULTIPOLYGON :408  
 1st Qu.:  565.8   1st Qu.:0.5494   epsg:5070    :  0  
 Median : 1276.6   Median :0.8909   +proj=aea ...:  0  
 Mean   : 1495.1   Mean   :0.7438                      
 3rd Qu.: 2175.0   3rd Qu.:1.0000                      
 Max.   :10064.8   Max.   :1.0000                      
ggplot(tracts_metrics) +
  geom_histogram(aes(x = child_density), bins = 30, fill = "lightblue", color = "white") +
  labs(title = "Distribution of Child Density (children/km²)")

ggplot(tracts_metrics) +
  geom_histogram(aes(x = access_ratio), bins = 30, fill = "darkseagreen3", color = "white") +
  labs(title = "Distribution of Access Ratio (coverage rate)")

ggplot() +
  geom_sf(data = tracts_metrics, aes(fill = child_density), color = NA) +
  scale_fill_viridis_c(option="mako", direction=-1) +
  labs(title = "Child Density by Census Tract")

ggplot() +
  geom_sf(data = tracts_metrics, aes(fill = access_ratio), color = NA) +
  scale_fill_viridis_c(option="mako", direction=-1) +
  labs(title = "Education Access Ratio by Census Tract")

# Define education desert

p_target <- 0.25

tracts_metrics <- tracts_metrics %>%
  mutate(
    z_child   = as.numeric(scale(child_density)), 
    z_access  = as.numeric(scale(-access_ratio)),
    score     = 0.5 * z_child + 0.5 * z_access 
  ) %>%
  mutate(
    cutoff = quantile(score, 1 - p_target, na.rm = TRUE),
    vulnerable_local = score >= cutoff
  )


cutoff_value <- unique(tracts_metrics$cutoff)
count_table  <- table(tracts_metrics$vulnerable_local)
percent_vuln <- mean(tracts_metrics$vulnerable_local, na.rm = TRUE)

summary_vulnerable <- tibble(
  Metric = c("Cutoff Score", 
             "Tracts classified as TRUE (Educational Desert)", 
             "Tracts classified as FALSE",
             "Share of Desert Tracts"),
  Value = c(
    round(cutoff_value, 3),
    as.integer(count_table["TRUE"]),
    as.integer(count_table["FALSE"]),
    paste0(round(percent_vuln * 100, 1), "%")
  )
)

kable(summary_vulnerable, caption = "Summary of Educational Desert Classification")
Summary of Educational Desert Classification
Metric Value
Cutoff Score 0.352
Tracts classified as TRUE (Educational Desert) 102
Tracts classified as FALSE 306
Share of Desert Tracts 25%
# Draw the map

## 1) Back to WGS84
tracts_map           <- st_transform(tracts_metrics, crs_map)
phl_boundary_plot    <- st_transform(phl_boundary_5070, crs_map)
schools_plot         <- st_transform(schools, crs_map) %>% mutate(Type = "Schools")
libraries_plot       <- st_transform(libraries_phl, crs_map) %>% mutate(Type = "Libraries")

facilities <- dplyr::bind_rows(
  dplyr::select(schools_plot,  Type),
  dplyr::select(libraries_plot, Type)
)

## 2) map
ggplot() +
  geom_sf(
    data = tracts_map, 
    aes(fill = vulnerable_local), 
    color = "grey20", size = 0.15
  ) +
  geom_sf(
    data = phl_boundary_plot, 
    fill = NA, color = "black", linewidth = 0.4
  ) +
  geom_sf(
    data = facilities,
    aes(color = Type),
    shape = 16,
    size = 1,
    alpha = 0.3,
    show.legend = TRUE
  ) +
  scale_fill_manual(
    values = c("FALSE" = "whitesmoke", "TRUE" = "wheat"),
    name = "Educational desert",
    labels = c("False", "True")
  ) +
  scale_color_manual(
    name = "Facilities",
    values = c("Schools" = "sienna", "Libraries" = "cornsilk4", alpha = 0.3)
  ) +
  guides(
    color = guide_legend(override.aes = list(shape = 16, size = 3, alpha = 1))
  ) +
  labs(
    title = "Educational Desert in Philadelphia",
    subtitle = "Areas with high child density and low coverage of schools & libraries",
    caption = "CRS for map = WGS84"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  )

legend_table <- tibble(
  Category = c("Educational Desert", "Schools", "Libraries"),
  Meaning = c("High child density, low access", 
              "Education facilities", 
              "Public libraries providing learning access")
)
kable(legend_table, caption = "Legend Explanation")
Legend Explanation
Category Meaning
Educational Desert High child density, low access
Schools Education facilities
Libraries Public libraries providing learning access
# Sensitivity test (add Recreation Centers)

## 1) Calculate the coverage rate after adding recreation centers"
intersect_access_all <- st_intersection(
  tracts_metrics %>% select(GEOID),
  st_make_valid(access_union_all) 
) %>%
  mutate(access_area_m2_all = as.numeric(st_area(geometry))) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  summarize(access_area_m2_all = sum(access_area_m2_all, na.rm = TRUE), .groups = "drop")

tracts_metrics_all <- tracts_metrics %>%
  left_join(intersect_access_all, by = "GEOID") %>%
  mutate(
    access_area_m2_all = coalesce(access_area_m2_all, 0),
    access_ratio_all   = pmin(access_area_m2_all / tract_area_m2, 1),
    z_access_all       = as.numeric(scale(-access_ratio_all)),

    score_all          = 0.5 * z_child + 0.5 * z_access_all
  ) %>%
  mutate(
    cutoff_all = quantile(score_all, 1 - p_target, na.rm = TRUE),
    vulnerable_all = score_all >= cutoff_all
  )

## 2) Compare with the previous results
sens_compare <- tracts_metrics_all %>%
  st_drop_geometry() %>%
  select(GEOID, vulnerable_core = vulnerable_local, vulnerable_all)

sens_compare <- sens_compare %>%
  mutate(
    status = dplyr::case_when(
      vulnerable_core &  vulnerable_all  ~ "Stable: Desert",
      !vulnerable_core & !vulnerable_all ~ "Stable: Not Desert",
      vulnerable_core & !vulnerable_all  ~ "Flipped: Rescued by Rec", 
      !vulnerable_core &  vulnerable_all ~ "Flipped: Newly Desert", 
      TRUE ~ "Other"
    )
  )

## 3) Summary table for sensitivity
total_tracts <- nrow(sens_compare)
sens_table <- sens_compare %>%
  count(status, name = "Count") %>%
  mutate(Share = paste0(round(100 * Count / total_tracts, 1), "%")) %>%
  arrange(desc(Count))

## 4) The degree of overlap between two sets of "deserts"
jaccard <- with(sens_compare, sum(vulnerable_core & vulnerable_all, na.rm = TRUE) /
                  sum(vulnerable_core | vulnerable_all, na.rm = TRUE))


kable(
  bind_rows(
    sens_table,
    tibble(
      status = "Jaccard (core vs all)",
      Count  = NA_integer_,
      Share  = paste0(round(jaccard * 100, 1), "%")
    )
  ),
  caption = "Sensitivity: Effect of Adding Recreation Centers on 'Educational Desert' Classification"
)
Sensitivity: Effect of Adding Recreation Centers on ‘Educational Desert’ Classification
status Count Share
Stable: Not Desert 298 73%
Stable: Desert 94 23%
Flipped: Newly Desert 8 2%
Flipped: Rescued by Rec 8 2%
Jaccard (core vs all) NA 85.5%
# Map the "flipped" tracts

## 1) reproject CRS to WGS 84
tracts_status_map <- tracts_metrics %>%
  select(GEOID, geometry) %>%
  left_join(sens_compare %>% select(GEOID, status), by = "GEOID") %>%
  st_transform(crs_map)

ggplot() +
  geom_sf(data = tracts_status_map, aes(fill = status), color = "grey30") +
  geom_sf(data = phl_boundary_plot, fill = NA, color = "black", linewidth = 0.4) +

  scale_fill_manual(
    name = "Classification change",
    values = c(
      "Stable: Desert"        = "wheat",
      "Stable: Not Desert"    = "#f7f7f7",
      "Flipped: Rescued by Rec" = "olivedrab",
      "Flipped: Newly Desert" = "#377eb8"
    )
  ) +
  scale_color_manual(
    name = "Facilities",
    values = c("Schools" = "mistyrose3", "Libraries" = "cornsilk4")
  ) +
  guides(
    color = guide_legend(override.aes = list(shape = 16, size = 3, alpha = 1))
  ) +
  labs(
    title = "Sensitivity Map: Tracts That Flip When Adding Recreation Centers",
    subtitle = "‘Rescued by Rec’ = Desert under core (schools+libraries) but NOT under all (including recreation)",
    caption = "CRS = WGS84"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  )

Analysis requirements:

  • Clear code comments explaining each step

  • Appropriate CRS transformations

  • Summary statistics or counts

  • At least one map showing your findings

  • Brief interpretation of results (3-5 sentences)

Your interpretation:

  • Child density is strongly right-skewed. Most tracts have low values with a few high outliers, so when combined with low access these tails tend to rank into the highest scores.

  • Access ratio piles up at 1.0. Many tracts are fully covered; most variation lives on partially covered edge tracts.

  • Hotspots for children vs. coverage do not fully coincide, creating cross-areas of “high need × low access” that rise in the composite score.

  • Sensitivity shows stability with some churn. Stable desert 93; stable non-desert 297; flipped both ways 9 each; Jaccard overlap 83.8%. Recreation centers shift rankings mainly around the threshold.

  • Overall, educational accessibility in Philadelphia is quite strong. The histogram of access ratios shows a heavy spike at 1.0, indicating that most tracts are fully covered by at least one school or library within a 0.5-mile radius. Spatially, only a few peripheral or industrial tracts exhibit low accessibility, suggesting an overall well-distributed and accessible educational infrastructure.

However, when overlaid with child population density, a few neighborhoods—particularly parts of North Philadelphia and the southwest low-income areas-emerge as pockets of high need but low access, forming the city’s main “educational desert” zones.

Limitations: - The “desert” share is fixed at 25%, meaning results indicate relative prioritization, not the true shortage of facilities. Even if new schools are added, there will always be a top 25%.

  • Access is purely spatial—distance-based—without accounting for facility capacity, quality, or transportation barriers. It measures potential accessibility, not real educational equity.

  • Equal weights (0.5/0.5) are simple but subjective; weights should ideally reflect policy priorities or empirical evidence.

Finally - A few comments about your incorporation of feedback!

  • To make sure that I can take responsibilities to my codes, I added comment lines everywhere I might need.

  • To make sure the tables are readable for both readers and me, I visualized important results of every chunk in part 3.


Submission Requirements

What to submit:

  1. Rendered HTML document posted to your course portfolio with all code, outputs, maps, and text
    • Use embed-resources: true in YAML so it’s a single file
    • All code should run without errors
    • All maps and charts should display correctly

File naming: LastName_FirstName_Assignment2.html and LastName_FirstName_Assignment2.qmd


Source Code
---
title: "Assignment 2: Spatial Analysis and Visualization"
subtitle: "Healthcare Access and Equity in Pennsylvania"
author: "Yuqing(Demi) Yang"
date: today
format: 
  html:
    theme: cosmo
    page-layout: full
    embed-resources: true
    toc-depth: 2
    code-fold: false
    code-tools: true
    toc-location: left
execute:
  warning: false
  message: false
  echo: true
---

## Assignment Overview

**Learning Objectives:**
- Apply spatial operations to answer policy-relevant research questions
- Integrate census demographic data with spatial analysis
- Create publication-quality visualizations and maps
- Work with spatial data from multiple sources
- Communicate findings effectively for policy audiences

---

## Part 1: Healthcare Access for Vulnerable Populations

### Research Question

- **Which Pennsylvania counties have the highest proportion of vulnerable populations (elderly + low-income) living far from hospitals?**

Your analysis should identify counties that should be priorities for healthcare investment and policy intervention.

### Required Analysis Steps

Complete the following analysis, documenting each step with code and brief explanations:

#### Step 1: Data Collection (5 points)

Load the required spatial data:
- Pennsylvania county boundaries
- Pennsylvania hospitals (from lecture data)
- Pennsylvania census tracts

**Your Task:**
```{r 1.1}
# Load required packages

library(sf)
library(tidyverse)
library(tigris)
library(tidycensus)
library(scales)
library(patchwork)
library(here)
library(units)
library(kableExtra)
census_api_key("42bf8a20a3df1def380f330cf7edad0dd5842ce6")

# Load spatial data

options(tigris_progress = FALSE, tigris_use_cache = TRUE)
pa_counties <- st_read(here("assignments/assignment_2/data/Pennsylvania_County_Boundaries.shp"),quiet = TRUE)
hospitals <- st_read(here("assignments/assignment_2/data/hospitals.geojson"),quiet = TRUE)
census_tracts <- tracts(state = "PA", cb = TRUE)

# Check that all data loaded correctly

## 1)Make a table for checking

layers <- list(
  pa_counties   = pa_counties,
  hospitals     = hospitals,
  census_tracts = census_tracts
)

qc <- function(x) {
  data.frame(
    is_sf    = inherits(x, "sf"),
    n        = nrow(x),
    geom     = paste(unique(as.character(st_geometry_type(x))), collapse = ", "),
    crs_epsg = st_crs(x)$epsg,
    crs_txt  = st_crs(x)$input,
    invalid  = sum(!st_is_valid(x)),
    empty    = sum(st_is_empty(x)),
    bbox     = paste(round(st_bbox(x), 3), collapse = ", ")
  )
}

qc_table <- do.call(rbind, lapply(layers, qc)) %>%
  tibble::rownames_to_column("Layer")

kable(
  qc_table,
  caption = "Summary of Spatial Layers and Geometry Quality Checks",
  align = "c"
) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover", "condensed")
  )

```

**Questions to answer:**

- **How many hospitals are in your dataset?**
  - 223

- **How many census tracts?**
  - 3445

- **What coordinate reference system is each dataset in?**
  - *Pennsylvania county boundaries* is in EPSG:3857(WGS 84 / Web-Mercator, meters), *Pennsylvania hospitals* is in EPSG:4326(WGS 84 geographic, degrees), and *Pennsylvania census tracts* is in EPSG:4269(NAD83 geographic, degrees).

---

#### Step 2: Get Demographic Data 

Use `tidycensus` to download tract-level demographic data for Pennsylvania.

**Required variables:**
- Total population
- Median household income
- Population 65 years and over (you may need to sum multiple age categories)

**Your Task:**
```{r 1.2}
# Get demographic data from ACS

## 1)Reach the data
core_vars <- c(
  total_pop     = "B01003_001",
  median_income = "B19013_001"
)

acs_core <- get_acs(
  geography = "tract",
  state     = "PA",
  variables = core_vars,
  year      = 2022,
  survey    = "acs5",
  output    = "wide",
  geometry  = FALSE
) %>%
  transmute(
    GEOID,
    total_pop     = total_popE,
    median_income = median_incomeE
  )

## 2) Calculate the population 65 years and over
age65_vars <- c(sprintf("B01001_%03d", 20:25), sprintf("B01001_%03d", 44:49))

acs_65_wide <- get_acs(
  geography = "tract", state = "PA", variables = age65_vars,
  year = 2022, survey = "acs5"
) %>%
  select(GEOID, variable, estimate) %>%
  pivot_wider(names_from = variable, values_from = estimate)

acs_65plus <- acs_65_wide %>%
  mutate(
    non_na_bins = rowSums(!is.na(across(all_of(age65_vars)))),
    over65 = if_else(
      non_na_bins == 0, NA_real_,
      rowSums(across(all_of(age65_vars)), na.rm = TRUE)
    )
  ) %>%
  select(GEOID, over65)

# Join to tract boundaries
options(tigris_progress = FALSE, tigris_use_cache = TRUE)

pa_tracts <- tracts(state = "PA", year = 2022, cb = TRUE, class = "sf")

pa_tracts_joined <- pa_tracts %>%
  left_join(acs_core,   by = "GEOID") %>%
  left_join(acs_65plus, by = "GEOID") %>%
  mutate(
    over65_share = if_else(!is.na(total_pop) & total_pop > 0, over65 / total_pop, NA_real_)
  )

## 2) Count the missing data and medium data of income
income_na_n   <- sum(is.na(pa_tracts_joined$median_income))
income_median_overall <- median(pa_tracts_joined$median_income, na.rm = TRUE)

## 3) Remove missing values
pa_tracts_demo <- pa_tracts_joined %>%
  filter(!is.na(median_income), !is.na(over65_share))

```

**Questions to answer:**

- **What year of ACS data are you using?**
  - 2022

- **How many tracts have missing income data?**
  - 62. Mostly because household = 0, or household is extremely small.

- **What is the median income across all PA census tracts?**
  - 70188

---

#### Step 3: Define Vulnerable Populations 

Identify census tracts with vulnerable populations based on TWO criteria:
1. Low median household income (choose an appropriate threshold)
2. Significant elderly population (choose an appropriate threshold)

**Your Task:**
```{r 1.3}
# Filter for vulnerable tracts based on your criteria

## 1) thresholds
income_threshold  <- quantile(pa_tracts_demo$median_income, 0.25, na.rm = TRUE, names = FALSE)
elderly_q75       <- quantile(pa_tracts_demo$over65_share, 0.75, na.rm = TRUE, names = FALSE)

## 2) flags + vulnerable
pa_tracts_demo <- pa_tracts_demo %>%
  mutate(
    low_income   = median_income <= income_threshold,
    high_elderly = over65_share >= elderly_q75,  
    vulnerable   = low_income & high_elderly
  )

## 3) quick outputs
elderly_q75
sum(pa_tracts_demo$vulnerable, na.rm = TRUE)
round(mean(pa_tracts_demo$vulnerable, na.rm = TRUE) * 100, 1)

```

**Questions to answer:**

- **What income threshold did you choose and why?**
  - The low-income threshold = tract median income ≤ state 25th percentile.A percentile-based, within-state threshold is robust to outliers and gives a clear “most disadvantaged” group.


- **What elderly population threshold did you choose and why?**
  - The aging threshold = 65+ share ≥ state 75th percentile.Using percentile cutoffs to identify the “upper tail” aligns with screening practice.Both thresholds follow a percentile-ranking approach widely used by *federal screening tools (SVI/EJScreen)*.

- **How many tracts meet your vulnerability criteria?**
  - 165

- **What percentage of PA census tracts are considered vulnerable by your definition?**
  - 4.9%

---

#### Step 4: Calculate Distance to Hospitals 

For each vulnerable tract, calculate the distance to the nearest hospital.

**Your Task:**
```{r 1.4}
# Transform to appropriate projected CRS

## 1) Copy and reproject (without modifying the original object)
distance_crs <- 5070
pa_tracts_demo_5070 <- st_transform(pa_tracts_demo, distance_crs)
hospitals_5070 <- st_transform(hospitals, distance_crs)

## 2) Keep only vulnerable tracts
vuln_5070 <- pa_tracts_demo_5070 %>%
  dplyr::filter(vulnerable %in% TRUE)

stopifnot(nrow(hospitals_5070) > 0)
stopifnot(nrow(vuln_5070) > 0)

# Calculate distance from each tract centroid to nearest hospital

## 1) Use an interior representative point
vuln_pts_5070 <- st_centroid(vuln_5070)

## 2) Pairwise distance matrix (meters)
dist_mat_m <- st_distance(vuln_pts_5070, hospitals_5070)

## 3) Distance: tract centroid -> nearest hospital
nearest_idx <- apply(dist_mat_m, 1, function(x) which.min(as.numeric(x)))
min_dist_m <- apply(dist_mat_m, 1, function(x) min(as.numeric(x), na.rm = TRUE))

## 4) meters → miles
dist_mi <- min_dist_m / 1609.344

## 5) Attach back to vulnerable tracts
vulnerable_tracts_dist_5070 <- vuln_5070 %>%
  mutate(
    nearest_hospital_row = nearest_idx,
    dist_to_hospital_mi  = dist_mi)


avg_dist_mi  <- mean(vulnerable_tracts_dist_5070$dist_to_hospital_mi, na.rm = TRUE)
max_dist_mi  <- max(vulnerable_tracts_dist_5070$dist_to_hospital_mi, na.rm = TRUE)
over15_count <- sum(vulnerable_tracts_dist_5070$dist_to_hospital_mi > 15, na.rm = TRUE)

cat(
  "Average distance to nearest hospital (miles): ", round(avg_dist_mi, 2), "\n",
  "Maximum distance (miles): ",                   round(max_dist_mi, 2), "\n",
  "Vulnerable tracts > 15 miles: ",               over15_count, "\n",
  sep = ""
)
```

**Requirements:**

- **Use an appropriate projected coordinate system for Pennsylvania**
- **Calculate distances in miles**
- **Explain why you chose your projection**
  - I use EPSG:5070 (NAD83 / Conus Albers) because it provides low distortion across the conterminous U.S., so distances computed statewide in Pennsylvania are much more reliable than in Web Mercator (distorts scale) or unprojected WGS84 (degrees). It also uses meters, making distance and area calculations straightforward, and it avoids zone-splitting issues you’d hit with StatePlane or UTM.

**Questions to answer:**

- **What is the average distance to the nearest hospital for vulnerable tracts?**
  - 4.75
- **What is the maximum distance?**
  - 19.32
- **How many vulnerable tracts are more than 15 miles from the nearest hospital?**
  - 10

---

#### Step 5: Identify Underserved Areas 

Define "underserved" as vulnerable tracts that are more than 15 miles from the nearest hospital.

**Your Task:**
```{r 1.5}
# Create underserved variable

vulnerable_tracts_dist_5070 <- vulnerable_tracts_dist_5070 %>%
  dplyr::mutate(underserved = dist_to_hospital_mi > 15)


sum(vulnerable_tracts_dist_5070$underserved, na.rm = TRUE)
round(mean(vulnerable_tracts_dist_5070$underserved, na.rm = TRUE) * 100, 1)


```

**Questions to answer:**

- **How many tracts are underserved?**
  - 10

- **What percentage of vulnerable tracts are underserved?**
  - 6.1%
  
- **Does this surprise you? Why or why not?**
  - Hospitals cluster in metro areas, so it’s expected that some rural, low-density tracts fall farther from facilities; under my definition, only ~6% of vulnerable tracts are underserved, which suggests overall access is fairly good. That said, pockets of concern remain, and distance alone doesn’t capture barriers like transportation or service capacity.

---

#### Step 6: Aggregate to County Level

Use spatial joins and aggregation to calculate county-level statistics about vulnerable populations and hospital access.

**Your Task:**
```{r 1.6}
# Spatial join tracts to counties

pa_counties_5070 <- st_transform(pa_counties, 5070)

vuln_with_county <- vulnerable_tracts_dist_5070 %>%
  st_join(pa_counties_5070 %>% select(COUNTY_NAM))

# Aggregate statistics by county

county_summary <- vuln_with_county %>%
  st_drop_geometry() %>%
  group_by(COUNTY_NAM) %>%
  summarise(
    vulnerable_tracts          = n(),
    underserved_tracts         = sum(underserved, na.rm = TRUE),
    pct_underserved            = 100 * underserved_tracts / vulnerable_tracts,
    avg_dist_mi_vulnerable     = mean(dist_to_hospital_mi, na.rm = TRUE),
    vulnerable_population      = sum(total_pop, na.rm = TRUE),
    vulnerable_pop_far_mi15    = sum(ifelse(underserved, total_pop, 0), na.rm = TRUE),
    .groups = "drop")
```


```{r 1.6.1}
county_summary %>%
  arrange(desc(pct_underserved)) %>%
  select(COUNTY_NAM, vulnerable_tracts, underserved_tracts, pct_underserved) %>%
  slice_head(n = 5)

county_summary %>%
  arrange(desc(vulnerable_pop_far_mi15)) %>%
  select(COUNTY_NAM, vulnerable_population, vulnerable_pop_far_mi15) %>%
  slice_head(n = 5) 

```

**Required county-level statistics:**

- Number of vulnerable tracts
- Number of underserved tracts  
- Percentage of vulnerable tracts that are underserved
- Average distance to nearest hospital for vulnerable tracts
- Total vulnerable population

**Questions to answer:**

- **Which 5 counties have the highest percentage of underserved vulnerable tracts?**
  - BRADFORD,COLUMBIA,JUNIATA,MONROE,PERRY
- **Which counties have the most vulnerable people living far from hospitals?**
  - PERRY(5800),BRADFORD(5466),CLEARFIELD(5189),DAUPHIN(4018),NORTHUMBERLAND(4018)
- **Are there any patterns in where underserved counties are located?**
  - Underserved counties tend to cluster in rural and lower-density parts of Pennsylvania, where hospital coverage is sparse and facilities are concentrated in metro hubs. You’ll often see higher shares along the northern/central rural corridor and Appalachian areas, reflecting longer travel distances and thinner provider networks relative to population.

---

#### Step 7: Create Summary Table 

Create a professional table showing the top 10 priority counties for healthcare investment.

**Your Task:**
```{r 1.7}
# Create and format priority counties table

priority_tbl <- county_summary %>%
  mutate(
    # safety: replace any NAs used in scoring
    across(c(vulnerable_pop_far_mi15, pct_underserved, avg_dist_mi_vulnerable,
             vulnerable_population, vulnerable_tracts, underserved_tracts),
           ~ tidyr::replace_na(., 0)),
    # Priority metric
    priority_score = 0.60 * rescale(vulnerable_pop_far_mi15) +
                     0.25 * rescale(pct_underserved) +
                     0.15 * rescale(avg_dist_mi_vulnerable)
  ) %>%
  arrange(desc(priority_score)) %>%
  slice_head(n = 10) %>%
  transmute(
    County                         = COUNTY_NAM,
    `Vulnerable tracts`            = vulnerable_tracts,
    `Underserved tracts`           = underserved_tracts,
    `% vulnerable underserved`     = percent(pct_underserved / 100, accuracy = 0.1),
    `Avg distance (mi)`            = number(avg_dist_mi_vulnerable, accuracy = 0.1),
    `Vulnerable people >15 mi`     = comma(vulnerable_pop_far_mi15),
    `Priority score`               = number(priority_score, accuracy = 0.01)
  )

# Nicely formatted table
knitr::kable(
  priority_tbl,
  caption = "Top 10 Priority Pennsylvania Counties for Healthcare Investment
  (Priority Score weights: 60% vulnerable people >15 mi, 25% % vulnerable underserved, 15% average distance)",
  booktabs = TRUE
) |>
  kableExtra::kable_styling(full_width = FALSE, font_size = 12)


```

**Requirements:**
- Use `knitr::kable()` or similar for formatting
- Include descriptive column names
- Format numbers appropriately (commas for population, percentages, etc.)
- Add an informative caption
- Sort by priority (you decide the metric)

---

## Part 2: Comprehensive Visualization 

Using the skills from Week 3 (Data Visualization), create publication-quality maps and charts.

### Map 1: County-Level Choropleth 

Create a choropleth map showing healthcare access challenges at the county level.

**Your Task:**
```{r 2.1}
# Create county-level access map

# 1) Reproject display layers to WGS84 (EPSG:4326)
crs_display <- 4326
pa_counties_4326 <- st_transform(pa_counties, crs_display)
hospitals_4326   <- st_transform(hospitals, crs_display)

# 2) Join county stats to county geometries
county_map_4326 <- pa_counties_4326 %>%
  left_join(
    county_summary %>%
      mutate(pct_underserved_prop = pct_underserved / 100), 
    by = "COUNTY_NAM"
  )

# 3) Choropleth (WGS84) + hospital points
ggplot(county_map_4326) +
  geom_sf(aes(fill = pct_underserved_prop), color = "white", linewidth = 0.2) +
  geom_sf(data = hospitals_4326, shape = 21, fill = "white", color = "black",
          size = 1.1, alpha = 0.8, inherit.aes = FALSE) +
  scale_fill_viridis_c(
    option = "cividis",
    na.value = "grey90",
    name = "Underserved share\n(of vulnerable tracts)",
    labels = label_percent(accuracy = 1)
  ) +
  labs(
    title    = "Healthcare Access Challenges in Pennsylvania",
    subtitle = "Counties filled by % of vulnerable tracts > 15 miles from a hospital",
    caption  = "Sources: ACS 5-year (2022), lecture hospital dataset. Distances computed in EPSG:5070; map shown in WGS84."
  ) +
  theme_void() +
  theme(
    legend.position = "right",
    plot.title      = element_text(face = "bold"),
    plot.caption    = element_text(hjust = 0)
  )

```

**Requirements:**
- Fill counties by percentage of vulnerable tracts that are underserved
- Include hospital locations as points
- Use an appropriate color scheme
- Include clear title, subtitle, and caption
- Use `theme_void()` or similar clean theme
- Add a legend with formatted labels

---

### Map 2: Detailed Vulnerability Map 

Create a map highlighting underserved vulnerable tracts.

**Your Task:**
```{r 2.2}
# Create detailed tract-level map

## 1)Ensure "underserved" exists; keep only vulnerable tracts and make a status label
vuln_dist_4326 <- vulnerable_tracts_dist_5070 %>%
  mutate(underserved = if (!"underserved" %in% names(.)) dist_to_hospital_mi > 15 else underserved) %>%
  st_transform(crs_display) %>%
  mutate(
    status = ifelse(underserved, "Underserved vulnerable", "Other vulnerable")
  )

## 2)Map
tracts_4326 <- sf::st_transform(census_tracts, 4326)
ggplot() +
  # county boundaries (context; subtle)
  geom_sf(data = pa_counties_4326, fill = NA, color = "grey40", linewidth = 0.3) +
  
  # tracts
  geom_sf(data = tracts_4326, fill = NA, color = "grey80", linewidth = 0.05) +
  
  # vulnerable tracts
  geom_sf(data = subset(vuln_dist_4326, status == "Other vulnerable"),
  aes(fill = status), color = "black", linewidth = 0.15, alpha = 0.9) +
  
  # underserved vulnerable tracts
  geom_sf(data = subset(vuln_dist_4326, status == "Underserved vulnerable"),
  aes(fill = status), color = "black", linewidth = 0.15, alpha = 0.9) +
  
  # hospitals (points)
  geom_sf(
  data = hospitals_4326,
  aes(shape = "Hospitals"),
  fill = "white", color = "black",
  size = 1.2, alpha = 0.9,
  inherit.aes = FALSE,
  show.legend = "point"
) +
  
  # fills for tracts
  scale_fill_manual(
    name = NULL,
    values = c("Other vulnerable" = "#9ecae1", "Underserved vulnerable" = "#08519c")
  ) +
  # shape for hospitals + legend tweaks
  scale_shape_manual(name = NULL, values = c("Hospitals" = 21)) +
  guides(
    fill  = guide_legend(order = 1, override.aes = list(shape = NA)),
    shape = guide_legend(order = 2, override.aes = list(fill = "white", color = "black", size = 2))
  ) +
  labs(
    title    = "Underserved Vulnerable Census Tracts in Pennsylvania",
    subtitle = "Underserved = vulnerable tracts (>15 miles to nearest hospital); counties and tract grids shown for context",
    caption  = "Sources: ACS 5-year (2022); hospital dataset. Distances: EPSG:5070; map: WGS84."
  ) +
  theme_void() +
  theme(legend.position = "right", plot.title = element_text(face = "bold"))

```

**Requirements:**

- Show underserved vulnerable tracts in a contrasting color

- Include county boundaries for context

- Show hospital locations

- Use appropriate visual hierarchy (what should stand out?)

- Include informative title and subtitle

---

### Chart: Distribution Analysis

Create a visualization showing the distribution of distances to hospitals for vulnerable populations.

**Your Task:**
```{r}
# Create distribution visualization

## A small helper used in captions
median_dist <- median(vulnerable_tracts_dist_5070$dist_to_hospital_mi, na.rm = TRUE)

## 1) People-weighted histogram of distances (highlighting the 15-mile threshold)
vulnerable_tracts_dist_5070 |> 
  mutate(beyond15 = factor(dist_to_hospital_mi > 15,
                           levels = c(FALSE, TRUE),
                           labels = c("≤ 15 miles", "> 15 miles"))) |>
  ggplot(aes(x = dist_to_hospital_mi, weight = total_pop, fill = beyond15)) +
  geom_histogram(binwidth = 1, boundary = 0, color = "white") +
  scale_fill_manual(values = c("≤ 15 miles" = "#9ecae1", "> 15 miles" = "#08519c"),
                    name = NULL) +
  scale_y_continuous(labels = comma) +
  labs(
    title = "How far are vulnerable residents from hospitals?",
    x = "Distance to nearest hospital (miles)",
    y = "Residents (people)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "right")

# 2) Scatter: Distance vs. tract population (vulnerable tracts)
ggplot(vulnerable_tracts_dist_5070,
       aes(x = dist_to_hospital_mi, y = total_pop, color = underserved)) +
  geom_point(alpha = 0.7) +
  scale_y_continuous(labels = comma, trans = "log10") +
  scale_color_manual(values = c(`FALSE` = "#9ecae1", `TRUE` = "#08519c"),
                     name = "Underserved\n(>15 mi)") +
  labs(
    title = "Distance vs. population size across vulnerable tracts",
    x = "Distance to nearest hospital (miles)",
    y = "Tract population (people)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "right")



```

**Suggested chart types:**

- Histogram or density plot of distances

- Box plot comparing distances across regions

- Bar chart of underserved tracts by county

- Scatter plot of distance vs. vulnerable population size

**Requirements:**

- Clear axes labels with units

- Appropriate title

- Professional formatting

- Brief interpretation (1-2 sentences as a caption or in text)
  - For histogram: Most vulnerable residents live within about %.1f miles; the right tail beyond 15 miles is relatively small.
  - For scatter plot: Most high-population vulnerable tracts are within ~15 miles; the far-distance outliers tend to be lower-density.

---

## Part 3: Bring Your Own Data Analysis 

Choose your own additional spatial dataset and conduct a supplementary analysis.

### Challenge Options

Choose ONE of the following challenge exercises, or propose your own research question using OpenDataPhilly data (https://opendataphilly.org/datasets/). 

---

#### Education & Youth Services

**Option A: Educational Desert Analysis**
- **Data:** Schools, Libraries, Recreation Centers, Census tracts (child population)
- **Question:** "Which neighborhoods lack adequate educational infrastructure for children?"
- **Operations:** Buffer schools/libraries (0.5 mile walking distance), identify coverage gaps, overlay with child population density
- **Policy relevance:** School district planning, library placement, after-school program siting

**Option B: School Safety Zones**
- **Data:** Schools, Crime Incidents, Bike Network
- **Question:** "Are school zones safe for walking/biking, or are they crime hotspots?"
- **Operations:** Buffer schools (1000ft safety zone),spatial join with crime incidents, assess bike infrastructure coverage
- **Policy relevance:** Safe Routes to School programs, crossing guard placement

---

#### Environmental Justice

**Option C: Green Space Equity** 
- **Data:** Parks, Street Trees, Census tracts (race/income demographics)
- **Question:** "Do low-income and minority neighborhoods have equitable access to green space?"
- **Operations:** Buffer parks (10-minute walk = 0.5 mile), calculate tree canopy or park acreage per capita, compare by demographics
- **Policy relevance:** Climate resilience, environmental justice, urban forestry investment
---

#### Public Safety & Justice

**Option D: Crime & Community Resources**
- **Data:** Crime Incidents, Recreation Centers, Libraries, Street Lights
- **Question:** "Are high-crime areas underserved by community resources?"
- **Operations:** Aggregate crime counts to census tracts or neighborhoods, count community resources per area, spatial correlation analysis
- **Policy relevance:** Community investment, violence prevention strategies
---

#### Infrastructure & Services

**Option E: Polling Place Accessibility**
- **Data:** Polling Places, SEPTA stops, Census tracts (elderly population, disability rates)
- **Question:** "Are polling places accessible for elderly and disabled voters?"
- **Operations:** Buffer polling places and transit stops, identify vulnerable populations, find areas lacking access
- **Policy relevance:** Voting rights, election infrastructure, ADA compliance

---

#### Health & Wellness

**Option F: Recreation & Population Health**
- **Data:** Recreation Centers, Playgrounds, Parks, Census tracts (demographics)
- **Question:** "Is lack of recreation access associated with vulnerable populations?"
- **Operations:** Calculate recreation facilities per capita by neighborhood, buffer facilities for walking access, overlay with demographic indicators
- **Policy relevance:** Public health investment, recreation programming, obesity prevention

---

#### Emergency Services

**Option G: EMS Response Coverage**
- **Data:** Fire Stations, EMS stations, Population density, High-rise buildings
- **Question:** "Are population-dense areas adequately covered by emergency services?"
- **Operations:** Create service area buffers (5-minute drive = ~2 miles), assess population coverage, identify gaps in high-density areas
- **Policy relevance:** Emergency preparedness, station siting decisions

---

#### Arts & Culture

**Option H: Cultural Asset Distribution**
- **Data:** Public Art, Museums, Historic sites/markers, Neighborhoods
- **Question:** "Do all neighborhoods have equitable access to cultural amenities?"
- **Operations:** Count cultural assets per neighborhood, normalize by population, compare distribution across demographic groups
- **Policy relevance:** Cultural equity, tourism, quality of life, neighborhood identity

---

### Data Sources

**OpenDataPhilly:** https://opendataphilly.org/datasets/
- Most datasets available as GeoJSON, Shapefile, or CSV with coordinates
- Always check the Metadata for a data dictionary of the fields.

**Additional Sources:**
- **Pennsylvania Open Data:** https://data.pa.gov/
- **Census Bureau (via tidycensus):** Demographics, economic indicators, commute patterns
- **TIGER/Line (via tigris):** Geographic boundaries

### Recommended Starting Points

**If you're feeling confident:** Choose an advanced challenge with multiple data layers. 
**If you are a beginner, choose something more manageable that helps you understand the basics**


**If you have a different idea:** Propose your own question! Just make sure:
- You can access the spatial data
- You can perform at least 2 spatial operations

### My Analysis

**My Task:**

1. **Find and load additional data**
   - Document your data source
   - Check and standardize the CRS
   - Provide basic summary statistics

```{r 3.1}
# Load your additional dataset

## 1) Philly tracts

options(tigris_progress = FALSE, tigris_use_cache = TRUE)
phl_tracts <- tracts(state = "PA", county = "Philadelphia", cb = TRUE, year = 2022)

phl_boundary <- counties(state = "PA", cb = TRUE, year = 2022) %>%
  filter(NAME == "Philadelphia")

## 2)Reach children data
phl_children <- get_acs(
  geography = "tract",
  variables = c(child_pop = "B09001_001"),
  state = "PA",
  county = "Philadelphia",
  year = 2022,
  output = "wide"
)

tracts_with_children <- phl_tracts %>%
  left_join(phl_children, by = "GEOID")

## 3)Reach POI
### recreation center raw data(philly)
PPR_Program_sites <- st_read("https://opendata.arcgis.com/api/v3/datasets/9eb26a787a6e448ba426eea7f9f0d93a_0/downloads/data?format=geojson&spatialRefId=4326", quiet = TRUE)

### schools raw data(philly)
schools <- st_read("https://hub.arcgis.com/api/v3/datasets/d46a7e59e2c246c891fbee778759717e_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1", quiet = TRUE)

### libraries raw data(global)
libraries <- st_read( "data/Libraries/Libraries.gpkg",layer = "parsed", quiet = TRUE)

## 4) Check the data

summary_rows <- tibble::tibble(
  Dataset = c("Census Tracts (phl_tracts)", 
              "Children ACS (phl_children)", 
              "Schools", 
              "Libraries"),
  Rows = c(
    nrow(phl_tracts),
    nrow(phl_children),
    nrow(schools),
    nrow(libraries)
  )
)

kable(
  summary_rows,
  caption = "Summary of Raw Datasets Loaded",
  align = "c"
) %>%
  kable_styling(
    full_width = FALSE,
    bootstrap_options = c("striped", "hover", "condensed"),
    position = "center"
  )

```
```{r 3.1.1,results='hide'}
# Data cleaning
## 1)Recreation center
names(PPR_Program_sites)

### Check the unique values containing "Recreation" in the field and see how they are written
unique(unlist(PPR_Program_sites))
unique(PPR_Program_sites$FACILITY_TYPE)

### Filter"Recreation Centers"
recreation_centers <- PPR_Program_sites %>%
  filter(
    grepl("recreation", park_name, ignore.case = TRUE)
  )

### See results
nrow(recreation_centers)
head(recreation_centers)

## 2)Libraries
### reproject CRS of libraries to boundary's 
st_crs(libraries)
st_crs(phl_boundary)
libraries <- st_transform(libraries, st_crs(phl_boundary))

### Cut out the libraries within the Philly area
libraries_phl <- st_intersection(libraries, phl_boundary)

### See results
nrow(libraries_phl)
head(libraries_phl)


## 3)Summary table
summary_table <- tibble(
  Dataset = c("Libraries", 
              "Schools", 
              "Recreation Centers"),
  Features = c(nrow(libraries_phl), 
               nrow(schools), 
               nrow(recreation_centers)),
  CRS = c(st_crs(libraries)$input, 
          st_crs(schools)$input, 
          st_crs(PPR_Program_sites)$input)
)

kable(summary_table, caption = "Summary of Datasets Used")

```

**Questions to answer:**

- **What dataset did you choose and why?**
  - I selected three datasets that represent educational and community infrastructure in Philadelphia: Libraries; Schools; Recreation Centers.
  
- **What is the data source and date?**
  - Libraries: Extracted from OpenStreetMap (via global GeoPackage export, likely updated in 2024–2025).
  - Schools: From OpenDataPhilly(https://opendataphilly.org/datasets/schools/?utm_source=chatgpt.com) – Schools dataset, maintained by the City of Philadelphia.
  - Recreation Centers: From Philadelphia Parks & Recreation Program Sites dataset, also hosted on OpenDataPhilly.
  - All datasets were accessed in October. 2025.
  
- **How many features does it contain?**
  - For libraries data, it has 81082 features at beginning, containing the libraries all over the world. After filter the libraries in Philly, the number of features shrinks to 80.
  - For schools, there are 495 schools(not including universities and colleges) in Philly.
  - For recreation centers, there are 67 of them in Philly.
  
- **What CRS is it in? Did you need to transform it?**
  - Libraries data is in NAD 83.
  - Schools data is in WGS 84.
  - Recreations data is in WGS 84.

---

2. **Pose a research question**

**Write a clear research statement that your analysis will answer.**

- Which neighborhoods in Philadelphia lack adequate educational infrastructure for children?

---

3. **Conduct spatial analysis**

Use at least TWO spatial operations to answer your research question.

**Required operations (choose 2+):**
- Buffers
- Spatial joins
- Spatial filtering with predicates
- Distance calculations
- Intersections or unions
- Point-in-polygon aggregation

**Your Task:**
```{r 3.2,results='hide'}
# Unify the coordinates and prepare the buffer

## 1) Reproject all layers to 5070
crs_analysis <- 5070
crs_map      <- 4326
half_mile_m  <- set_units(0.3, "mile") %>% set_units("m") %>% drop_units()

phl_tracts_5070        <- st_transform(phl_tracts, crs_analysis)
phl_boundary_5070      <- st_transform(phl_boundary, crs_analysis)
schools_5070           <- st_transform(schools, crs_analysis)
recreation_centers_5070<- st_transform(recreation_centers, crs_analysis)
libraries_phl_5070     <- st_transform(libraries_phl, crs_analysis)


## 2) Create buffer(schools & libraries, recreation centers do sensitivity)
buf_schools   <- st_buffer(schools_5070,  dist = half_mile_m)
buf_libraries <- st_buffer(libraries_phl_5070, dist = half_mile_m)

## 3) Accessible region
access_union_core <- st_union(st_union(buf_schools), st_union(buf_libraries))

## 4) Use recreation center to sensitivity comparison
buf_recreation <- st_buffer(recreation_centers_5070, dist = half_mile_m)
access_union_all <- st_union(access_union_core, st_union(buf_recreation))

## 5)Check the result
st_crs(schools_5070)
st_crs(libraries_phl_5070)
st_crs(phl_tracts_5070)

ggplot() +
  geom_sf(data = phl_boundary_5070, fill = NA, color = "black") +
  geom_sf(data = buf_schools, fill = "darkslateblue", alpha = 0.3) +
  geom_sf(data = buf_libraries, fill = "steelblue", alpha = 0.3) +
  labs(title = "0.5-mile Buffers Around Schools and Libraries")


```

```{r 3.3}
# Calculate the "education coverage rate" and "child density" of each census tracts

## 1) Area (m²) and child density (per square kilometer)

tracts_metrics <- phl_tracts_5070 %>%
  mutate(tract_area_m2 = as.numeric(st_area(geometry)),
         tract_area_km2 = tract_area_m2 / 1e6) %>%
  left_join(tracts_with_children %>% st_drop_geometry() %>%
              select(GEOID, child_pop = child_popE), by = "GEOID") %>%
  mutate(child_density = ifelse(tract_area_km2 > 0, child_pop / tract_area_km2, NA_real_))

## 2) Coverage rate = Intersection area between the reachable zone and the census zone (m²)
intersect_access <- st_intersection(
  tracts_metrics %>% select(GEOID),
  st_make_valid(access_union_core)
) %>%
  mutate(access_area_m2 = as.numeric(st_area(geometry))) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  summarize(access_area_m2 = sum(access_area_m2, na.rm = TRUE), .groups = "drop")

tracts_metrics <- tracts_metrics %>%
  left_join(intersect_access, by = "GEOID") %>%
  mutate(access_area_m2 = coalesce(access_area_m2, 0),
         access_ratio   = pmin(access_area_m2 / tract_area_m2, 1))

## 3)Check the statistical distribution of child_density and access_ratio
summary(tracts_metrics[, c("child_density", "access_ratio")])

ggplot(tracts_metrics) +
  geom_histogram(aes(x = child_density), bins = 30, fill = "lightblue", color = "white") +
  labs(title = "Distribution of Child Density (children/km²)")

ggplot(tracts_metrics) +
  geom_histogram(aes(x = access_ratio), bins = 30, fill = "darkseagreen3", color = "white") +
  labs(title = "Distribution of Access Ratio (coverage rate)")

ggplot() +
  geom_sf(data = tracts_metrics, aes(fill = child_density), color = NA) +
  scale_fill_viridis_c(option="mako", direction=-1) +
  labs(title = "Child Density by Census Tract")

ggplot() +
  geom_sf(data = tracts_metrics, aes(fill = access_ratio), color = NA) +
  scale_fill_viridis_c(option="mako", direction=-1) +
  labs(title = "Education Access Ratio by Census Tract")


```

```{r 3.4}
# Define education desert

p_target <- 0.25

tracts_metrics <- tracts_metrics %>%
  mutate(
    z_child   = as.numeric(scale(child_density)), 
    z_access  = as.numeric(scale(-access_ratio)),
    score     = 0.5 * z_child + 0.5 * z_access 
  ) %>%
  mutate(
    cutoff = quantile(score, 1 - p_target, na.rm = TRUE),
    vulnerable_local = score >= cutoff
  )


cutoff_value <- unique(tracts_metrics$cutoff)
count_table  <- table(tracts_metrics$vulnerable_local)
percent_vuln <- mean(tracts_metrics$vulnerable_local, na.rm = TRUE)

summary_vulnerable <- tibble(
  Metric = c("Cutoff Score", 
             "Tracts classified as TRUE (Educational Desert)", 
             "Tracts classified as FALSE",
             "Share of Desert Tracts"),
  Value = c(
    round(cutoff_value, 3),
    as.integer(count_table["TRUE"]),
    as.integer(count_table["FALSE"]),
    paste0(round(percent_vuln * 100, 1), "%")
  )
)

kable(summary_vulnerable, caption = "Summary of Educational Desert Classification")

```

```{r 3.5}
# Draw the map

## 1) Back to WGS84
tracts_map           <- st_transform(tracts_metrics, crs_map)
phl_boundary_plot    <- st_transform(phl_boundary_5070, crs_map)
schools_plot         <- st_transform(schools, crs_map) %>% mutate(Type = "Schools")
libraries_plot       <- st_transform(libraries_phl, crs_map) %>% mutate(Type = "Libraries")

facilities <- dplyr::bind_rows(
  dplyr::select(schools_plot,  Type),
  dplyr::select(libraries_plot, Type)
)

## 2) map
ggplot() +
  geom_sf(
    data = tracts_map, 
    aes(fill = vulnerable_local), 
    color = "grey20", size = 0.15
  ) +
  geom_sf(
    data = phl_boundary_plot, 
    fill = NA, color = "black", linewidth = 0.4
  ) +
  geom_sf(
    data = facilities,
    aes(color = Type),
    shape = 16,
    size = 1,
    alpha = 0.3,
    show.legend = TRUE
  ) +
  scale_fill_manual(
    values = c("FALSE" = "whitesmoke", "TRUE" = "wheat"),
    name = "Educational desert",
    labels = c("False", "True")
  ) +
  scale_color_manual(
    name = "Facilities",
    values = c("Schools" = "sienna", "Libraries" = "cornsilk4", alpha = 0.3)
  ) +
  guides(
    color = guide_legend(override.aes = list(shape = 16, size = 3, alpha = 1))
  ) +
  labs(
    title = "Educational Desert in Philadelphia",
    subtitle = "Areas with high child density and low coverage of schools & libraries",
    caption = "CRS for map = WGS84"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  )

legend_table <- tibble(
  Category = c("Educational Desert", "Schools", "Libraries"),
  Meaning = c("High child density, low access", 
              "Education facilities", 
              "Public libraries providing learning access")
)
kable(legend_table, caption = "Legend Explanation")
```

```{r 3.6}
# Sensitivity test (add Recreation Centers)

## 1) Calculate the coverage rate after adding recreation centers"
intersect_access_all <- st_intersection(
  tracts_metrics %>% select(GEOID),
  st_make_valid(access_union_all) 
) %>%
  mutate(access_area_m2_all = as.numeric(st_area(geometry))) %>%
  st_drop_geometry() %>%
  group_by(GEOID) %>%
  summarize(access_area_m2_all = sum(access_area_m2_all, na.rm = TRUE), .groups = "drop")

tracts_metrics_all <- tracts_metrics %>%
  left_join(intersect_access_all, by = "GEOID") %>%
  mutate(
    access_area_m2_all = coalesce(access_area_m2_all, 0),
    access_ratio_all   = pmin(access_area_m2_all / tract_area_m2, 1),
    z_access_all       = as.numeric(scale(-access_ratio_all)),

    score_all          = 0.5 * z_child + 0.5 * z_access_all
  ) %>%
  mutate(
    cutoff_all = quantile(score_all, 1 - p_target, na.rm = TRUE),
    vulnerable_all = score_all >= cutoff_all
  )

## 2) Compare with the previous results
sens_compare <- tracts_metrics_all %>%
  st_drop_geometry() %>%
  select(GEOID, vulnerable_core = vulnerable_local, vulnerable_all)

sens_compare <- sens_compare %>%
  mutate(
    status = dplyr::case_when(
      vulnerable_core &  vulnerable_all  ~ "Stable: Desert",
      !vulnerable_core & !vulnerable_all ~ "Stable: Not Desert",
      vulnerable_core & !vulnerable_all  ~ "Flipped: Rescued by Rec", 
      !vulnerable_core &  vulnerable_all ~ "Flipped: Newly Desert", 
      TRUE ~ "Other"
    )
  )

## 3) Summary table for sensitivity
total_tracts <- nrow(sens_compare)
sens_table <- sens_compare %>%
  count(status, name = "Count") %>%
  mutate(Share = paste0(round(100 * Count / total_tracts, 1), "%")) %>%
  arrange(desc(Count))

## 4) The degree of overlap between two sets of "deserts"
jaccard <- with(sens_compare, sum(vulnerable_core & vulnerable_all, na.rm = TRUE) /
                  sum(vulnerable_core | vulnerable_all, na.rm = TRUE))


kable(
  bind_rows(
    sens_table,
    tibble(
      status = "Jaccard (core vs all)",
      Count  = NA_integer_,
      Share  = paste0(round(jaccard * 100, 1), "%")
    )
  ),
  caption = "Sensitivity: Effect of Adding Recreation Centers on 'Educational Desert' Classification"
)

```

```{r 3.7}
# Map the "flipped" tracts

## 1) reproject CRS to WGS 84
tracts_status_map <- tracts_metrics %>%
  select(GEOID, geometry) %>%
  left_join(sens_compare %>% select(GEOID, status), by = "GEOID") %>%
  st_transform(crs_map)

ggplot() +
  geom_sf(data = tracts_status_map, aes(fill = status), color = "grey30") +
  geom_sf(data = phl_boundary_plot, fill = NA, color = "black", linewidth = 0.4) +

  scale_fill_manual(
    name = "Classification change",
    values = c(
      "Stable: Desert"        = "wheat",
      "Stable: Not Desert"    = "#f7f7f7",
      "Flipped: Rescued by Rec" = "olivedrab",
      "Flipped: Newly Desert" = "#377eb8"
    )
  ) +
  scale_color_manual(
    name = "Facilities",
    values = c("Schools" = "mistyrose3", "Libraries" = "cornsilk4")
  ) +
  guides(
    color = guide_legend(override.aes = list(shape = 16, size = 3, alpha = 1))
  ) +
  labs(
    title = "Sensitivity Map: Tracts That Flip When Adding Recreation Centers",
    subtitle = "‘Rescued by Rec’ = Desert under core (schools+libraries) but NOT under all (including recreation)",
    caption = "CRS = WGS84"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  )

```

**Analysis requirements:**

- Clear code comments explaining each step

- Appropriate CRS transformations

- Summary statistics or counts

- At least one map showing your findings
- Brief interpretation of results (3-5 sentences)

**Your interpretation:**

- Child density is strongly **right-skewed**. Most tracts have low values with a few high outliers, so when combined with low access these tails tend to rank into the highest scores.

- Access ratio **piles up at 1.0**. Many tracts are fully covered; most variation lives on partially covered edge tracts.

- Hotspots for children vs. coverage **do not fully coincide**, creating cross-areas of **"high need × low access"** that rise in the composite score.

- Sensitivity shows stability with some churn. Stable desert 93; stable non-desert 297; flipped both ways 9 each; Jaccard overlap 83.8%. Recreation centers shift rankings mainly around the threshold.

- Overall, educational accessibility in Philadelphia is **quite strong**. The histogram of access ratios shows a heavy spike at 1.0, indicating that most tracts are fully covered by at least one school or library within a 0.5-mile radius. Spatially, only a few **peripheral or industrial tracts** exhibit low accessibility, suggesting an overall well-distributed and accessible educational infrastructure.

However, when overlaid with child population density, a few neighborhoods—particularly parts of **North Philadelphia and the southwest low-income areas**-emerge as pockets of high need but low access, forming the city’s main "educational desert" zones.

**Limitations:**
- The “desert” share is **fixed at 25%**, meaning results indicate relative prioritization, not the true shortage of facilities. Even if new schools are added, there will always be a top 25%.

- Access is purely spatial—distance-based—without accounting for **facility capacity, quality, or transportation barriers**. It measures potential accessibility, not real educational equity.

- Equal weights (0.5/0.5) are simple but subjective; weights should ideally reflect policy priorities or empirical evidence.

## Finally - A few comments about your incorporation of feedback!

- To make sure that I can take responsibilities to my codes, I added comment lines everywhere I might need.

- To make sure the tables are readable for both readers and me, I visualized important results of every chunk in part 3.

---

## Submission Requirements

**What to submit:**

1. **Rendered HTML document posted to your course portfolio** with all code, outputs, maps, and text
   - Use `embed-resources: true` in YAML so it's a single file
   - All code should run without errors
   - All maps and charts should display correctly

**File naming:** `LastName_FirstName_Assignment2.html` and `LastName_FirstName_Assignment2.qmd`

---